-
-
Notifications
You must be signed in to change notification settings - Fork 19
Add crawled urls #2232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add crawled urls #2232
Conversation
…during version bump
Warning Rate limit exceeded@vhpx has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 6 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis pull request introduces several modifications across the repository. In the workflow scripts, logic is enhanced to distinguish version changes in the package configuration. Database migrations add new tables while removing unused columns and ensuring proper constraints. Multiple projects update dependency versions and simplify ThemeProvider configurations. Enhancements to web crawler components include new UI elements, utility functions, and API endpoints for both initiating crawls and fetching their statuses. Additional type safety improvements and environment variable configurations are also introduced. Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant CB as CrawlButton
participant API as Crawl API Endpoint
participant SCM as SCRAPER Service
participant DB as Database
U->>CB: Clicks "Crawl" button
CB->>API: Sends POST request (/api/v1/workspaces/{wsId}/crawl)
API->>API: Validate API key/session and input
API->>DB: Query keys, secrets, and check URL existence
API->>SCM: Request to crawl the URL
SCM-->>API: Return crawled data
API->>DB: Insert crawled results
API-->>CB: Respond with success
CB->>U: Trigger onSuccess callback (updates UI)
sequenceDiagram
participant CC as CrawlerContent
participant API as Status API Endpoint
participant DB as Database
loop Every 5 seconds
CC->>API: Sends GET request for crawl status
API->>DB: Query crawled_urls and crawled_url_next_urls
DB-->>API: Return crawler status data
API-->>CC: Respond with current status and related URLs
end
Suggested labels
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🔭 Outside diff range comments (2)
apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/(default)/page.tsx (1)
44-54
: 💡 Verification agent🧩 Analysis chain
Verify the impact of removing data transformation.
The data transformation step that added the
href
property has been removed. Please ensure that this property is not required by the table columns or any other components.Run the following script to check if any code relies on the
href
property:
🏁 Script executed:
#!/bin/bash # Description: Check for usage of the `href` property in table columns and related components. # Test: Search for `href` usage in column definitions rg -A 5 'href' apps/web/src/app/\[locale\]/\(dashboard\)/\[wsId\]/crawlers/columns.ts # Test: Search for `href` usage in components that might use the table data rg -A 5 'href' apps/web/src/app/\[locale\]/\(dashboard\)/\[wsId\]/crawlers/Length of output: 3378
Issue Identified: Missing
href
Property for Row ActionsThe
RowActions
component in the crawlers module expects ahref
property (accessed asrow.original.href
). Our search confirms that while one cell incolumns.tsx
computes a URL usingrow.getValue('id')
, theRowActions
component still relies on ahref
property that would have been added in the removed data transformation. This discrepancy can cause the view button (or any link relying on thehref
) in the row actions not to render or work as intended.
RowActions
Usage: Inapps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/row-actions.tsx
, the link is conditionally rendered only ifhref
is provided.- Computed URL: Although
columns.tsx
computes a URL for one of its columns, it passesrow.original.href
toRowActions
, which will be undefined after the transformation removal.Please verify whether this behavior is acceptable or if the code should either (a) revert the removal to ensure the
href
property is added to each data row or (b) update theRowActions
usage to compute thehref
similar to the approach in the column definition.apps/db/supabase/migrations/20250210132933_new_migration.sql (1)
45-103
:⚠️ Potential issueReview security implications of broad permissions.
The migration grants full permissions (including truncate) to anonymous users. Consider restricting permissions based on actual requirements.
Consider removing unnecessary permissions, especially for anonymous users:
-grant truncate on table "public"."workspace_crawlers" to "anon"; -grant trigger on table "public"."workspace_crawlers" to "anon"; -grant references on table "public"."workspace_crawlers" to "anon";
🧹 Nitpick comments (12)
apps/web/src/app/api/v1/workspaces/[wsId]/crawl/route.ts (3)
11-28
: Validate the URL input more thoroughly
Currently, the code only checks for the existence ofurl
, but doesn't validate whether it's a valid or properly formatted URL. Consider using a stricter validation step (e.g., checking for valid protocols) to prevent passing malformed URLs to the crawler.
86-92
: Use structured logging for errors
The code logs errors usingconsole.log
. In production, a structured logging approach or a centralized logging service is preferable to simplify troubleshooting and to preserve sensitive details securely.
94-155
: Check for missing or invalidSCRAPER_URL
in environment
The code fetchesprocess.env.SCRAPER_URL
without verifying its presence or validity. IfSCRAPER_URL
is not set, it may produce an unhelpful runtime error. Consider returning a 500 status with a clear message if this environment variable is missing.apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/crawl-button.tsx (2)
6-14
: Ensure consistent error handling or user feedback
The component logs errors to the console but does not provide user-visible feedback. For a better user experience, consider adding a notification/toast or UI indicator if crawling fails.
17-46
: Handle non-OK responses more robustly
Currently, the code simply logs an error ifres.ok
is false and then returns. If the server returns a 400 or 403, it might be helpful to parse the JSON body to show a user-friendly message explaining why the crawl failed.apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/utils.ts (1)
20-53
: Handle self-closing tags and complex HTML structures informatHTML
The current approach incrementally adjustsdepth
and relies on<
/>
detection. Self-closing tags, inline elements, or malformed HTML can disrupt the formatted output. Consider leveraging a proper HTML parser for a more reliable formatting approach if your use case demands it.apps/web/src/app/api/v1/workspaces/[wsId]/crawlers/status/route.ts (1)
15-34
: Enhance error handling with detailed messages.Include the error details in the response for better debugging while ensuring no sensitive information is exposed.
if (crawledUrlError) { return NextResponse.json( - { message: 'Error fetching crawled URL' }, + { + message: 'Error fetching crawled URL', + error: crawledUrlError.message || 'Unknown error' + }, { status: 500 } ); }apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/(default)/page.tsx (1)
59-101
: Implement search functionality.The
getData
function accepts a search query parameterq
but doesn't utilize it in the query builder. Consider implementing search functionality to filter the crawlers based on the query.Apply this diff to add search functionality:
const queryBuilder = supabase .from('workspace_crawlers') .select('*') .order('created_at', { ascending: true }) .eq('ws_id', wsId); +if (q) { + queryBuilder.ilike('url', `%${q}%`); +} if (page && pageSize) {apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/crawler-content.tsx (1)
142-177
: Enhance URL display with truncation and copy functionality.Long URLs may overflow the container, and there's no easy way to copy them.
Apply this diff to improve the URL display:
+'use client'; import CrawlButton from './crawl-button'; import { formatHTML, unescapeMarkdownString } from './utils'; import { MemoizedReactMarkdown } from '@/components/markdown'; +import { Copy } from 'lucide-react'; +import { toast } from 'sonner'; // ... <span className={ relatedUrl.skipped ? 'text-muted-foreground' : '' } > - {relatedUrl.url} + <div className="flex items-center gap-2"> + <span className="truncate max-w-[500px]" title={relatedUrl.url}> + {relatedUrl.url} + </span> + <button + onClick={() => { + navigator.clipboard.writeText(relatedUrl.url); + toast.success('URL copied to clipboard'); + }} + className="p-1 hover:bg-muted rounded-md" + > + <Copy className="h-4 w-4" /> + </button> + </div> </span>apps/db/supabase/migrations/20250210132933_new_migration.sql (1)
129-134
: Consider adding indexes for URL lookups.The
crawled_urls
andcrawled_url_next_urls
tables will likely be queried by URL frequently. Consider adding indexes to improve query performance.create table "public"."crawled_urls" ( "id" uuid not null default gen_random_uuid(), "url" text not null, "html" text, "markdown" text, "created_at" timestamp with time zone not null default now() ); +CREATE INDEX idx_crawled_urls_url ON public.crawled_urls USING btree (url); create table "public"."crawled_url_next_urls" ( "origin_id" uuid not null default gen_random_uuid(), "url" text not null, "skipped" boolean not null, "created_at" timestamp with time zone not null default now() ); +CREATE INDEX idx_crawled_url_next_urls_url ON public.crawled_url_next_urls USING btree (url);Also applies to: 136-142
.github/workflows/check-and-bump-versions.yaml (2)
296-310
: Improve code maintainability.While the logic is correct, consider these improvements for better maintainability:
- Extract magic number "1" as a constant
- Store "package.json" as a variable
- Reduce nesting by early returns
Apply this diff to improve the code:
+ # Constants + PACKAGE_JSON="package.json" + SINGLE_FILE=1 + # If only package.json changed, check if it's only version changes - if [ "$(echo "$SOURCE_FILES_CHANGED" | wc -l)" -eq 1 ] && [ "$(echo "$SOURCE_FILES_CHANGED" | grep "package.json$")" ]; then - if check_package_json_changes; then - echo "Only version field changed in package.json, skipping version bump" - SHOULD_BUMP=false - else - echo "Package.json has meaningful changes" - SHOULD_BUMP=true - fi - else - echo "Source files changed:" - echo "$SOURCE_FILES_CHANGED" - SHOULD_BUMP=true - fi + # Set default to true for source changes + SHOULD_BUMP=true + + # Check if only package.json changed + if [ "$(echo "$SOURCE_FILES_CHANGED" | wc -l)" -eq $SINGLE_FILE ] && \ + [ "$(echo "$SOURCE_FILES_CHANGED" | grep "$PACKAGE_JSON$")" ] && \ + check_package_json_changes; then + echo "Only version field changed in $PACKAGE_JSON, skipping version bump" + SHOULD_BUMP=false + return + fi + + echo "Source files changed:" + echo "$SOURCE_FILES_CHANGED"
271-271
: Remove trailing spaces.Remove trailing spaces to maintain consistent code style.
Apply this diff to fix the trailing spaces:
- DIFF=$(git diff HEAD~1 HEAD -- package.json) - + DIFF=$(git diff HEAD~1 HEAD -- package.json) + # Count number of changed lines that aren't the version field - NON_VERSION_CHANGES=$(echo "$DIFF" | grep -v '"version":' | grep '^[+-]' | wc -l) - + NON_VERSION_CHANGES=$(echo "$DIFF" | grep -v '"version":' | grep '^[+-]' | wc -l) +Also applies to: 274-274
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 271-271: trailing spaces
(trailing-spaces)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yaml
is excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (23)
.github/workflows/check-and-bump-versions.yaml
(2 hunks)apps/db/supabase/migrations/20250210132933_new_migration.sql
(1 hunks)apps/external/package.json
(1 hunks)apps/mira/package.json
(1 hunks)apps/mira/src/components/providers.tsx
(1 hunks)apps/nova/package.json
(1 hunks)apps/nova/src/app/[locale]/(dashboard)/[wsId]/layout.tsx
(2 hunks)apps/nova/src/components/theme-provider.tsx
(0 hunks)apps/rewise/src/components/providers.tsx
(1 hunks)apps/web/.env.example
(1 hunks)apps/web/package.json
(1 hunks)apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/(default)/page.tsx
(1 hunks)apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/crawl-button.tsx
(1 hunks)apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/crawler-content.tsx
(1 hunks)apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/page.tsx
(2 hunks)apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/utils.ts
(1 hunks)apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/columns.tsx
(1 hunks)apps/web/src/app/api/v1/workspaces/[wsId]/crawl/route.ts
(1 hunks)apps/web/src/app/api/v1/workspaces/[wsId]/crawlers/status/route.ts
(1 hunks)apps/web/src/components/providers.tsx
(1 hunks)packages/supabase/src/next/server.ts
(1 hunks)packages/types/src/supabase.ts
(1 hunks)turbo.json
(1 hunks)
💤 Files with no reviewable changes (1)
- apps/nova/src/components/theme-provider.tsx
✅ Files skipped from review due to trivial changes (2)
- apps/external/package.json
- apps/web/.env.example
🧰 Additional context used
🪛 YAMLlint (1.35.1)
.github/workflows/check-and-bump-versions.yaml
[error] 271-271: trailing spaces
(trailing-spaces)
[error] 274-274: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (16)
apps/mira/src/components/providers.tsx (1)
9-9
: LGTM! Theme options simplified.The reduction to three core themes ('system', 'light', 'dark') aligns with the broader theme simplification across the application.
apps/web/src/components/providers.tsx (1)
9-9
: LGTM! Consistent theme implementation.The theme configuration matches the implementation in other applications, maintaining consistency across the codebase.
apps/rewise/src/components/providers.tsx (1)
10-10
: LGTM! Theme configuration properly implemented.The theme array matches other applications while maintaining its specific dark theme default.
apps/nova/src/app/[locale]/(dashboard)/[wsId]/layout.tsx (1)
4-4
: LGTM! Theme provider properly migrated.The migration from custom theme provider to direct usage of next-themes is properly implemented with consistent configuration.
Let's verify the removal of the custom theme provider:
Also applies to: 21-27
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check if the old theme provider file has been removed rg -l "theme-provider" apps/nova/src/components/Length of output: 48
LGTM – Theme provider migration verified.
The search for "theme-provider" in the components directory returned no results, suggesting that the old custom theme provider has been removed. The import in
apps/nova/src/app/[locale]/(dashboard)/[wsId]/layout.tsx
now correctly usesnext-themes
with configuration matching elsewhere in the codebase.Please perform a manual verification for any residual references if needed.
apps/web/src/app/api/v1/workspaces/[wsId]/crawl/route.ts (1)
5-10
: Confirm the need forPromise
in params
Theinterface Params
usesparams: Promise<{ wsId: string }>;
, which is unusual. Typically, Next.js handlers expect plain objects rather than Promises. Verify that returning a Promise forparams
is intentional, as it may complicate or delay extractingwsId
.apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/utils.ts (1)
1-18
: Confirm edge-case handling for escaped sequences
unescapeMarkdownString
replaces well-known escape sequences, but edge cases like\\\n
or nested escapes may not be addressed. Verify if more robust or specialized handling is required for your use case.apps/web/src/app/api/v1/workspaces/[wsId]/crawlers/status/route.ts (1)
49-53
: LGTM!The response format is clean and handles null values appropriately.
apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/page.tsx (2)
39-54
: LGTM!The component structure is clean and props are passed correctly.
13-38
: 🛠️ Refactor suggestionOptimize data fetching and enhance error handling.
- Run queries in parallel to improve performance
- Add error handling for subsequent queries
const { wsId, crawlerId } = await params; const supabase = await createClient(); - const { data: crawler } = await supabase - .from('workspace_crawlers') - .select('*') - .eq('id', crawlerId) - .single(); + const [crawlerResult, crawledUrlResult] = await Promise.all([ + supabase + .from('workspace_crawlers') + .select('*') + .eq('id', crawlerId) + .single(), + supabase + .from('crawled_urls') + .select('*') + .eq('url', crawler.url) + .maybeSingle(), + ]); - if (!crawler) notFound(); + if (crawlerResult.error) throw new Error('Failed to fetch crawler'); + if (!crawlerResult.data) notFound(); + const crawler = crawlerResult.data; - const { data: crawledUrl } = await supabase - .from('crawled_urls') - .select('*') - .eq('url', crawler.url) - .maybeSingle(); + if (crawledUrlResult.error) throw new Error('Failed to fetch crawled URL'); + const crawledUrl = crawledUrlResult.data;Likely an incorrect or invalid review comment.
packages/supabase/src/next/server.ts (1)
4-4
: LGTM! Good type safety improvement.Replacing
any
withReadonlyRequestCookies
improves type safety and makes the code more maintainable.Also applies to: 7-7
turbo.json (1)
27-27
: LGTM! Addition of SCRAPER_URL environment variable.The addition of the
SCRAPER_URL
environment variable aligns well with the PR's objective of adding crawled URLs functionality.apps/nova/package.json (1)
111-111
: Verify jotai version update compatibility.The jotai package is being updated from ^2.11.3 to ^2.12.0. Let's verify this version exists and check for any breaking changes.
✅ Verification successful
🌐 Web query:
What's new in jotai version 2.12.0? Are there any breaking changes?
💡 Result:
Jotai's release history up to version 2.11.3 shows significant internal improvements but no documentation exists in these sources about version 2.12.0 specifically[1][2]. Based on the available information:
Version Patterns:
- Recent releases (v2.11.x) focused on internal architecture changes and stability fixes rather than public API updates[2].
- The project follows semantic versioning, so minor versions (e.g., 2.X) typically avoid breaking changes unless explicitly noted[1][4].
Breaking Changes:
Major breaking changes were introduced in earlier v2 releases:
- Async atoms now require explicit promise handling (
await get(asyncAtom)
instead of automatic resolution)[1][3]- Removed deprecated APIs like
initialValues
andscope
props in favor ofstore
-based patterns[1]- TypeScript type updates for writable atoms (
WritableAtom<Value, Args, Result>
)[1]For v2.12.x specifically, check Jotai's official GitHub Releases page or npm registry entries (not included in these sources), as minor patches after v2.11.x likely address edge cases rather than introduce breaking changes[2][4].
Citations:
- 1: https://jotai.org/docs/guides/migrating-to-v2-api
- 2: https://github.com/pmndrs/jotai/releases
- 3: RFC: New Jotai v2 API pmndrs/jotai#1514
- 4: https://jotai.org/docs
- 5: https://jotai.org/docs/core/atom
- 6: https://webprogrammingforappsandservices.sdds.ca/Managing-Application-State/introducing-jotai
- 7: https://github.com/jotaijs/jotai-signal/blob/main/CHANGELOG.md
- 8: https://github.com/jotaijs/jotai-cache/blob/main/CHANGELOG.md
- 9: https://jotai.org
- 10: https://www.npmjs.com/package/react-native-mmkv/v/2.12.0
Jotai version update verified as compatible
Based on the latest web query, the upgrade from ^2.11.3 to ^2.12.0 is a minor version bump following semantic versioning. There is no official documentation indicating breaking changes for 2.12.0, and the expected focus on stability fixes/internal improvements suggests compatibility.
- Confirmed that the update adheres to semantic versioning.
- No unexpected breaking changes are reported for 2.12.0.
apps/mira/package.json (1)
111-111
: Same jotai version update as in nova package.This change mirrors the jotai version update in the nova package, maintaining consistency across the project.
apps/db/supabase/migrations/20250210132933_new_migration.sql (2)
1-8
: Verify data migration plan for removed columns.The removal of
html_ids
,type
, andurl
columns fromworkspace_datasets
could affect existing data. Ensure there's a data migration plan in place.
10-17
: LGTM! Well-structured workspace_crawlers table.The new table includes all necessary fields and appropriate constraints for managing crawled content.
apps/web/package.json (1)
119-119
: Same jotai version update as in other packages.This change completes the coordinated jotai version update across all packages.
apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/crawler-content.tsx
Show resolved
Hide resolved
apps/web/src/app/[locale]/(dashboard)/[wsId]/crawlers/[crawlerId]/crawler-content.tsx
Show resolved
Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Summary by CodeRabbit
New Features
Enhancements